Oct 17, 2022
By example:
# Initial imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import geopandas as gpd
3 new packages:
Out-of-core (larger-than-memory) operations in Python
Extends the maximum file size from the size of memory to the size of your hard drive
Dask stores does not evaluate expressions immediately — rather it stores a task graph of the necessary calculations.
# Create an array of normally-distributed random numbers
a = np.random.randn(1000)
a[:10]
array([ 2.60729614, 1.79697236, -0.73649402, 0.77187343, 1.05163172,
0.50237791, 0.71621133, -0.43416688, 0.04266553, -1.28853453])
# Multiply this array by a factor
b = a * 4
# Find the minimum value
b_min = b.min()
b_min
-14.195604074580396
Dask arrays mirror the numpy array syntax...but don't perform the calculation right away.
import dask.array as da
# Create a dask array from the above array
a_dask = da.from_array(a, chunks=200)
# Multiply this array by a factor
b_dask = a_dask * 4
# Find the minimum value
b_min_dask = b_dask.min()
print(b_min_dask)
dask.array<amin-aggregate, shape=(), dtype=float64, chunksize=(), chunktype=numpy.ndarray>
We need to explicitly call the compute() function to evaluate the expression. We get the same result as the non-lazy numpy result.
b_min_dask.compute()
-14.195604074580396
Pandas DataFrames don't need to fit into memory...when evaluating an expression, dask will load the data into memory when necessary.
See datasets.yml in this week's repository.
import intake
datasets = intake.open_catalog("./datasets.yml")
list(datasets)
['nyc_taxi_wide', 'census', 'osm']
datasets.osm
osm:
args:
engine: fastparquet
urlpath: http://s3.amazonaws.com/datashader-data/osm-1billion.parq.zip
description: ''
driver: intake_parquet.source.ParquetSource
metadata:
cache:
- argkey: urlpath
decomp: zip
regex: datashader-data
regex_filter: .parq[/]?$
type: compressed
catalog_dir: /Users/nhand/Teaching/PennMUSA/Fall2022/week-7/./
to_dask()¶osm_ddf = datasets.osm.to_dask()
/Users/nhand/mambaforge/envs/musa-550-fall-2022/lib/python3.9/site-packages/dask/dataframe/backends.py:189: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index) /Users/nhand/mambaforge/envs/musa-550-fall-2022/lib/python3.9/site-packages/dask/dataframe/backends.py:189: FutureWarning: pandas.Float64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index) /Users/nhand/mambaforge/envs/musa-550-fall-2022/lib/python3.9/site-packages/dask/dataframe/backends.py:189: FutureWarning: pandas.UInt64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index)
osm_ddf
| x | y | |
|---|---|---|
| npartitions=119 | ||
| float32 | float32 | |
| ... | ... | |
| ... | ... | ... |
| ... | ... | |
| ... | ... |
Only the data necessary to see the head of the file will be loaded.
# we can still get the head of the file quickly
osm_ddf.head(n=10)
| x | y | |
|---|---|---|
| 0 | -16274360.0 | -17538778.0 |
| 1 | -16408889.0 | -16618700.0 |
| 2 | -16246231.0 | -16106805.0 |
| 3 | -19098164.0 | -14783157.0 |
| 4 | -17811662.0 | -13948767.0 |
| 5 | -17751736.0 | -13926740.0 |
| 6 | -17711376.0 | -13921245.0 |
| 7 | -17532738.0 | -13348323.0 |
| 8 | -19093358.0 | -10380358.0 |
| 9 | -19077458.0 | -10445329.0 |
All data partitions must be loaded for this calculation...it will take longer!
# getting the length means all of the data must be loaded though
nrows = len(osm_ddf)
print(f"number of rows = {nrows}")
number of rows = 1000050363
# mean x/y coordinates
mean_x = osm_ddf['x'].mean()
mean_y = osm_ddf['y'].mean()
print(mean_x, mean_y)
dd.Scalar<series-..., dtype=float32> dd.Scalar<series-..., dtype=float32>
# evaluate the expressions
print("mean x = ", mean_x.compute())
mean x = 2731828.836864097
# evaluate the expressions
print("mean y = ", mean_y.compute())
mean y = 5801830.125332437
Matplotlib struggles with only hundreds of points
"Turns even the largest data into images, accurately"
Datashader is a library that produces a "rasterized" image of large datasets, such that the visual color mapping is a fair representation of the underlying data.
Recommended reading: Understanding the datashader algorithm
# Datashader imports
import datashader as ds
import datashader.transfer_functions as tf
# Color-related imports
from datashader.colors import Greys9, viridis, inferno
from colorcet import fire
Set up the global datashader canvas:
# Web Mercator bounds
bound = 20026376.39
global_x_range = (-bound, bound)
global_y_range = (int(-bound*0.4), int(bound*0.6))
# Default width and height
global_plot_width = 900
global_plot_height = int(global_plot_width*0.5)
# Step 1: Setup the canvas
canvas = ds.Canvas(
plot_width=global_plot_width,
plot_height=global_plot_height,
x_range=global_x_range,
y_range=global_y_range,
)
# Step 2: Aggregate the points into pixels
# NOTE: Use the "count()" function — count number of points per pixel
agg = canvas.points(osm_ddf, "x", "y", agg=ds.count())
agg
<xarray.DataArray (y: 450, x: 900)>
array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]], dtype=uint32)
Coordinates:
* x (x) float64 -2e+07 -1.996e+07 -1.992e+07 ... 1.996e+07 2e+07
* y (y) float64 -7.988e+06 -7.944e+06 ... 1.195e+07 1.199e+07# Step 3: Perform the shade operation
img = tf.shade(agg, cmap=fire)
# Format: set the background of the image to black so it looks better
img = tf.set_background(img, "black")
img
Remember: our agg variable is an xarray DataArray.
So, we can leverage xarray's builtin where() function to select a subsample of the pixels based on the pixel counts.
selected = agg.where(agg > 15)
selected
<xarray.DataArray (y: 450, x: 900)>
array([[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]])
Coordinates:
* x (x) float64 -2e+07 -1.996e+07 -1.992e+07 ... 1.996e+07 2e+07
* y (y) float64 -7.988e+06 -7.944e+06 ... 1.195e+07 1.199e+07where condition — masked pixels are set to NaN¶# plot the masked data
tf.set_background(tf.shade(selected, cmap=fire),"black")
We can use datashader to visualize all 300 million census dots produced as part of the Cooper Center racial dot map.
# Load the data
# REMEMBER: this will take some time to download the first time
census_ddf = datasets.census.to_dask()
census_ddf.head()
| easting | northing | race | |
|---|---|---|---|
| 0 | -12418767.0 | 3697425.00 | h |
| 1 | -12418512.0 | 3697143.50 | h |
| 2 | -12418245.0 | 3697584.50 | h |
| 3 | -12417703.0 | 3697636.75 | w |
| 4 | -12418120.0 | 3697129.25 | h |
print("number of rows =", len(census_ddf))
number of rows = 306675004
Important: datashader has a utility function to convert from latitude/longitude (EPSG=4326) to Web Mercator (EPSG=3857)
See: lnglat_to_meters()
from datashader.utils import lnglat_to_meters
# Sensible lat/lng coordinates for U.S. cities
# NOTE: these are in lat/lng so EPSG=4326
USA = [(-124.72, -66.95), (23.55, 50.06)]
Chicago = [( -88.29, -87.30), (41.57, 42.00)]
NewYorkCity = [( -74.39, -73.44), (40.51, 40.91)]
LosAngeles = [(-118.53, -117.81), (33.63, 33.96)]
Houston = [( -96.05, -94.68), (29.45, 30.11)]
Austin = [( -97.91, -97.52), (30.17, 30.37)]
NewOrleans = [( -90.37, -89.89), (29.82, 30.05)]
Atlanta = [( -84.88, -84.04), (33.45, 33.84)]
Philly = [( -75.28, -74.96), (39.86, 40.14)]
# Get USA xlim and ylim in meters (EPSG=3857)
USA_xlim_meters, USA_ylim_meters = [list(r) for r in lnglat_to_meters(USA[0], USA[1])]
# Define some a default plot width & height
plot_width = int(900)
plot_height = int(plot_width*7.0/12)
# Step 1: Setup the canvas
cvs = ds.Canvas(
plot_width, plot_height, x_range=USA_xlim_meters, y_range=USA_ylim_meters
)
# Step 2: Aggregate the x/y points
agg = cvs.points(census_ddf, "easting", "northing")
# Step 3: Shade with a "Grey" colormap and "linear" colormapping
img = tf.shade(agg, cmap=Greys9, how="linear")
# Format: Set the background
tf.set_background(img, "black")
# Step 3: Shade with a "Grey" colormap and "log" colormapping
img = tf.shade(agg, cmap=Greys9, how='log')
# Format: add a black background
img = tf.set_background(img, 'black')
img
"A collection of perceptually accurate colormaps"
See: https://colorcet.holoviz.org/
## Step 3: Shade with "fire" color scale and "log" colormapping
img = tf.shade(agg, cmap=fire, how='log')
tf.set_background(img, 'black')
# Step 3: Shade with fire colormap and equalized histogram mapping
img = tf.shade(agg, cmap=fire, how='eq_hist')
tf.set_background(img, 'black')
img = tf.shade(agg, cmap=viridis, how='eq_hist')
img = tf.set_background(img, 'black')
img
Use the export_image() function.
from datashader.utils import export_image
export_image(img, 'usa_census_viridis')
Datashader can plot use different colors for different categories of data.
census_ddf.head()
| easting | northing | race | |
|---|---|---|---|
| 0 | -12418767.0 | 3697425.00 | h |
| 1 | -12418512.0 | 3697143.50 | h |
| 2 | -12418245.0 | 3697584.50 | h |
| 3 | -12417703.0 | 3697636.75 | w |
| 4 | -12418120.0 | 3697129.25 | h |
census_ddf['race'].value_counts().compute()
w 196052887 h 50317503 b 37643995 a 13914371 o 8746248 Name: race, dtype: int64
color_key = {"w": "aqua", "b": "lime", "a": "red", "h": "fuchsia", "o": "yellow"}
def create_census_image(longitude_range, latitude_range, w=plot_width, h=plot_height):
"""
A function for plotting the Census data, coloring pixel by race values.
"""
# Step 1: Calculate x and y range from lng/lat ranges
x_range, y_range = lnglat_to_meters(longitude_range, latitude_range)
# Step 2: Setup the canvas
canvas = ds.Canvas(plot_width=w, plot_height=h, x_range=x_range, y_range=y_range)
# Step 3: Aggregate, but this time count the "race" category
# NEW: specify the aggregation method to count the "race" values in each pixel
agg = canvas.points(census_ddf, "easting", "northing", ds.count_cat("race"))
# Step 4: Shade, using our custom color map
img = tf.shade(agg, color_key=color_key, how="eq_hist")
# Return image with black background
return tf.set_background(img, "black")
Color pixel values according to the demographics data in each pixel.
create_census_image(USA[0], USA[1])
create_census_image(Philly[0], Philly[1], w=600, h=600)
Hint: use the bounding boxes provided earlier to explore racial patterns across various major cities
create_census_image(NewYorkCity[0], NewYorkCity[1])
create_census_image(Atlanta[0], Atlanta[1])
create_census_image(LosAngeles[0], LosAngeles[1])
create_census_image(Houston[0], Houston[1])
create_census_image(Chicago[0], Chicago[1])
create_census_image(NewOrleans[0], NewOrleans[1])
We can use xarray to slice the array of aggregated pixel values to examine specific aspects of the data.
Use the sel() function of the xarray array
# Step 1: Setup canvas
cvs = ds.Canvas(plot_width=plot_width, plot_height=plot_height)
# Step 2: Aggregate and count race category
aggc = cvs.points(census_ddf, "easting", "northing", agg=ds.count_cat("race"))
# NEW: Select only African Americans (where "race" column is equal to "b")
agg_b = aggc.sel(race="b")
# Step 3: Shade and set background
img = tf.shade(agg_b, cmap=fire, how="eq_hist")
img = tf.set_background(img, "black")
img
Goal: Select pixels where each race has a non-zero count.
bool_sel = aggc.sel(race=['w', 'b', 'a', 'h']) > 0
bool_sel
<xarray.DataArray (northing: 525, easting: 900, race: 4)>
array([[[False, False, False, False],
[False, False, False, False],
[False, False, False, False],
...,
[False, False, False, False],
[False, False, False, False],
[False, False, False, False]],
[[False, False, False, False],
[False, False, False, False],
[False, False, False, False],
...,
[False, False, False, False],
[False, False, False, False],
[False, False, False, False]],
[[False, False, False, False],
[False, False, False, False],
[False, False, False, False],
...,
...
...,
[False, False, False, False],
[False, False, False, False],
[False, False, False, False]],
[[False, False, False, False],
[False, False, False, False],
[False, False, False, False],
...,
[False, False, False, False],
[False, False, False, False],
[False, False, False, False]],
[[False, False, False, False],
[False, False, False, False],
[False, False, False, False],
...,
[False, False, False, False],
[False, False, False, False],
[False, False, False, False]]])
Coordinates:
* easting (easting) float64 -1.388e+07 -1.387e+07 ... -7.464e+06 -7.457e+06
* northing (northing) float64 2.822e+06 2.828e+06 ... 6.326e+06 6.333e+06
* race (race) <U1 'w' 'b' 'a' 'h'# Do a "logical and" operation across the "race" dimension
# Pixels will be "True" if the pixel has a positive count for each race
diverse_selection = bool_sel.all(dim='race')
diverse_selection
<xarray.DataArray (northing: 525, easting: 900)>
array([[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
...,
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False]])
Coordinates:
* easting (easting) float64 -1.388e+07 -1.387e+07 ... -7.464e+06 -7.457e+06
* northing (northing) float64 2.822e+06 2.828e+06 ... 6.326e+06 6.333e+06# Select the pixel values where our diverse selection criteria is True
agg2 = aggc.where(diverse_selection).fillna(0)
# and shade using our color key
img = tf.shade(agg2, color_key=color_key, how='eq_hist')
img = tf.set_background(img,"black")
img
# Select where the "b" race dimension is greater than the "w" race dimension
selection = aggc.sel(race='b') > aggc.sel(race='w')
selection
<xarray.DataArray (northing: 525, easting: 900)>
array([[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
...,
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False]])
Coordinates:
* easting (easting) float64 -1.388e+07 -1.387e+07 ... -7.464e+06 -7.457e+06
* northing (northing) float64 2.822e+06 2.828e+06 ... 6.326e+06 6.333e+06# Select based on the selection criteria
agg3 = aggc.where(selection).fillna(0)
img = tf.shade(agg3, color_key=color_key, how="eq_hist")
img = tf.set_background(img, "black")
img
Let's use hvplot
# Initialize hvplot and dask
import hvplot.pandas
import hvplot.dask # NEW: dask works with hvplot too!
import holoviews as hv
import geoviews as gv
To speed up interactive calculations, you can "persist" a dask array in memory (load the data fully into memory). You should have at least 16 GB of memory to avoid memory errors, though!
If not persisted, the data will be loaded on demand to avoid memory issues, which will slow the interactive nature of the plots down slightly.
# UNCOMMENT THIS LINE IF YOU HAVE AT LEAST 16 GB OF MEMORY
census_ddf = census_ddf.persist()
# Plot the points
points = census_ddf.hvplot.points(
x="easting",
y="northing",
datashade=True, # NEW: tell hvplot to use datashader!
aggregator=ds.count(),
cmap=fire,
geo=True,
crs=3857, # Input data is in 3857, so we need to tell hvplot
frame_width=plot_width,
frame_height=plot_height,
xlim=USA_xlim_meters, # Specify the xbounds in meters (EPSG=3857)
ylim=USA_ylim_meters, # Specify the ybounds in meters (EPSG=3857)
)
# Put a tile source behind it
bg = gv.tile_sources.CartoDark
bg * points
Note: interactive features (panning, zooming, etc) can be slow, but the map will eventually re-load!
Similar syntax to previous examples...
# Points with categorical colormap
race_map = census_ddf.hvplot.points(
x="easting",
y="northing",
datashade=True,
c="race", # NEW: color pixels by "race" column
aggregator=ds.count_cat("race"), # NEW: specify the aggregator
cmap=color_key, # NEW: use our custom color map dictionary
crs=3857,
geo=True,
frame_width=plot_width,
frame_height=plot_height,
xlim=USA_xlim_meters,
ylim=USA_ylim_meters,
)
bg = gv.tile_sources.CartoDark
bg * race_map
We can easily overlay Congressional districts on our map...
# Load congressional districts and convert to EPSG=3857
districts = gpd.read_file('./data/cb_2015_us_cd114_5m').to_crs(epsg=3857)
# Plot the district map
districts_map = districts.hvplot.polygons(
geo=True,
crs=3857,
line_color="white",
fill_alpha=0,
frame_width=plot_width,
frame_height=plot_height,
xlim=USA_xlim_meters,
ylim=USA_ylim_meters
)
bg * districts_map
/Users/nhand/mambaforge/envs/musa-550-fall-2022/lib/python3.9/site-packages/geoviews/operation/projection.py:79: ShapelyDeprecationWarning: Iteration over multi-part geometries is deprecated and will be removed in Shapely 2.0. Use the `geoms` property to access the constituent parts of a multi-part geometry. polys = [g for g in geom if g.area > 1e-15]
# Combine the background, race map, and districts into a single map
img = bg * race_map * districts_map
img
/Users/nhand/mambaforge/envs/musa-550-fall-2022/lib/python3.9/site-packages/geoviews/operation/projection.py:79: ShapelyDeprecationWarning: Iteration over multi-part geometries is deprecated and will be removed in Shapely 2.0. Use the `geoms` property to access the constituent parts of a multi-part geometry. polys = [g for g in geom if g.area > 1e-15]